
\section{Fitting Lines -- Regression}
\label{sec:linefitting}

{
\pgfkeys{
    /pgfmanual/gray key prefixes={/pgfplots/table},
}

This section documents the attempts of \PGFPlots{} to fit lines to input
coordinates. \PGFPlots{} currently supports |create col/linear regression|
applied to columns of input tables. The feature relies on \PGFPlotstable{}, it
is actually implemented as a table postprocessing method.

\begin{stylekey}{/pgfplots/table/create col/linear regression=\marg{key-value-config}}
\pgfkeys{
    /pgfmanual/gray key prefixes={/pgfplots/table/create col/linear regression/},
    /pdflinks/search key prefixes in/.add={/pgfplots/table/create col/linear regression/,}{},
}
    A style for use in |\addplot table| which computes a linear (least squares)
    regression $y(x) = a \cdot x + b$ using the sample data $(x_i,y_i)$ which
    has to be specified inside of \meta{key-value-config} (see below).

    It creates a new column on the fly which contains the values $y(x_i) = a
    \cdot x_i + b$. The values $a$ and $b$ will be stored (globally) into
    \declareandlabel{\pgfplotstableregressiona} and
    \declareandlabel{\pgfplotstableregressionb}.

\begin{codeexample}[]
\begin{tikzpicture}
    \begin{axis}[legend pos=outer north east]
        \addplot table {% plot X versus Y. This is original data.
            X Y
            1 1
            2 4
            3 9
            4 16
            5 25
            6 36
        };
        \addplot table [
            y={create col/linear regression={y=Y}}, % compute a linear regression from the input table
        ] {
            X Y
            1 1
            2 4
            3 9
            4 16
            5 25
            6 36
        };
        %\xdef\slope{\pgfplotstableregressiona} %<-- might be handy occasionally
        \addlegendentry{$y(x)$}
        \addlegendentry{%
            $\pgfmathprintnumber{\pgfplotstableregressiona} \cdot x
            \pgfmathprintnumber[print sign]{\pgfplotstableregressionb}$}
    \end{axis}
\end{tikzpicture}
\end{codeexample}
    %
    The example above has two plots: one showing the data and one containing
    the |linear regression| line. We use |y={create col/linear regression={}}|
    here, which means to create a new column\footnote{The \texttt{y=\{create
    col/} feature is available for any other \PGFPlotstable{} postprocessing
    style, see the \texttt{create on use} documentation in the \PGFPlotstable{}
    manual.} containing the regression values automatically. As arguments, we
    need to provide the $y$ column name explicitly.\footnote{In fact,
    \PGFPlots{} sees that there are only two columns and uses the second by
    default. But you need to provide it if there are at least 3 columns.} The
    $x$ value is determined from context: |linear regression| is evaluated
    inside of |\addplot table|, so it uses the same $x$ as |\addplot table|
    (i.e.\@ if you write |\addplot table[x=|\marg{col name}|]|, the regression
    will also use \meta{col name} as its |x| input). Furthermore, it shows the
    line parameters $a$ and $b$ in the legend.

    Note that the uncommented line with
    |\xdef\slope{\pgfplotstableregressiona}| is useful if you have \emph{more
    than one} regression line: it copies the value of
    |\pgfplotstableregressiona| (in this case) into a new global variable
    called `|\slope|'. This allows to use `|\slope|' instead of
    |\pgfplotstableregressiona| -- even after |\pgfplotstableregressiona| has
    been overwritten.

    The following \meta{key-value-config} keys are accepted as comma-separated
    list:

    \begin{key}{%
        /pgfplots/table/create col/linear regression/table=%
            \marg{\textbackslash macro {\normalfont or} file name} (initially empty)%
    }
        Provides the table from where to load the |x| and |y| columns. It
        defaults to the currently processed one, i.e.\@ to the value of
        |\pgfplotstablename|.
    \end{key}

    \begin{keylist}{%
        /pgfplots/table/create col/linear regression/x=\marg{column} (initially empty),
        /pgfplots/table/create col/linear regression/y=\marg{column} (initially empty)%
    }
        Provides the source of $x_i$ and $y_i$ data, respectively. The argument
        \meta{column} is usually a column name of the input table, yet it can
        also contain |[index]|\meta{integer} to designate column indices
        (starting with $0$), |create on use| specifications or |alias|es (see
        the \PGFPlotstable{} manual for details on |create on use| and
        |alias|).

        The initial configuration (an empty value) checks the context where the
        |linear regression| is evaluated. If it is evaluated inside of
        |\pgfplotstabletypeset|, it uses the first and second table columns. If
        it is evaluated inside of |\addplot table|, it uses the same $x$ input
        as the |\addplot table| statement. The |y| key needs to be provided
        explicitly (unless the table has only two columns).
    \end{keylist}

    \begin{keylist}{%
        /pgfplots/table/create col/linear regression/xmode=\mchoice{auto,linear,log} (initially auto),
        /pgfplots/table/create col/linear regression/ymode=\mchoice{auto,linear,log} (initially auto)%
    }
        Enables or disables processing of logarithmic coordinates. Logarithmic
        processing means to apply $\ln$ before computing the regression line
        and $\exp$ afterwards.

        The choice |auto| checks if the column is evaluated inside of a
        \PGFPlots{} axis. If so, it uses the axis scaling of the embedding
        axis. Otherwise, it uses |linear|.

        In case of logarithmic coordinates, the |log basis x| and |log basis y|
        keys determine the basis.

\begin{codeexample}[]
\begin{tikzpicture}
\begin{loglogaxis}
 \addplot table [x=dof,y=error2]
    {pgfplotstable.example1.dat};
  \addlegendentry{$y(x)$}

 \addplot table [
      x=dof,
      y={create col/linear regression={y=error2}},
 ] {pgfplotstable.example1.dat};

  % might be handy occasionally:
  %\xdef\slope{\pgfplotstableregressiona}
 \addlegendentry{slope
   $\pgfmathprintnumber{\pgfplotstableregressiona}$}
\end{loglogaxis}
\end{tikzpicture}
\end{codeexample}

        The (commented) line containing |\slope| is explained above; it allows
        to remember different regression slopes in our example.
    \end{keylist}

    \begin{keylist}{%
        /pgfplots/table/create col/linear regression/variance list=\marg{list} (initially empty),
        /pgfplots/table/create col/linear regression/variance=\marg{column name} (initially empty)%
    }
        Both keys allow to provide uncertainties (variances) to single data
        points. A high (relative) variance indicates an unreliable data point,
        a value of $1$ is standard.

        The |variance list| key allows to provide variances directly as
        comma-separated list, for example

        |variance list={1000,1000,500,200,1,1}|.

        The |variance| key allows to load values from a table \meta{column
        name}. Such a column name is (initially, see below) loaded from the
        same table where data points have been found. The \meta{column name}
        may also be a |create on use| name.
        %
\begin{codeexample}[]
\begin{tikzpicture}
\begin{loglogaxis}
 \addplot table [x=dof,y=error2]
    {pgfplotstable.example1.dat};
  \addlegendentry{$y(x)$}

 \addplot table [
      x=dof,
        y={create col/linear regression={
            y=error2,
            variance list={1000,800,600,500,400},
        }}
 ] {pgfplotstable.example1.dat};

 \addlegendentry{slope
  $\pgfmathprintnumber{\pgfplotstableregressiona}$}
\end{loglogaxis}
\end{tikzpicture}
\end{codeexample}

        If both, |variance list| and |variance| are given, the first one will
        be preferred. Note that it is not necessary to provide variances for
        every data point.
    \end{keylist}

    \begin{key}{/pgfplots/table/create col/linear regression/variance src=\marg{\textbackslash table {\normalfont or} file name} (initially empty)}
        Allows to load the |variance| from another table. The initial setting
        is empty. It is acceptable if the |variance| column in the external
        table has fewer entries than expected, in this case, only the first
        ones will be used.
    \end{key}

    \begin{key}{/pgfplots/table/create col/linear regression/variance format=\mchoice{linear,log} (initially log)}

		The default configuration assumes that variance is already given in logarithmic coordinates and might prove to be unsuitable for exponential functions. Use |variance format=linear| in order to map the variance to log coordinates explicitly. This applies even if no variance is specified:

\begin{codeexample}[code only]
% file plotdata/approx_exp.dat
n  eapprox
1  97.71828182845904
2  101.38905609893065
3  87.08553692318768
4  92.59815003314424
5  194.4131591025766
6  410.4287934927351
7  1174.6331584284585
8  3032.9579870417283
9  8198.083927575384
10 22124.465794806718
11 59973.14171519782
\end{codeexample}
        
\begin{codeexample}[]
\begin{tikzpicture}
\begin{axis}
	\addplot table[x=n,y=eapprox]
		{plotdata/approx_exp.dat};
	\addplot[no markers,red,thick] 
		table [x=n,y={create col/linear regression={
			y=eapprox,ymode=log,
			variance format=linear}}]
		{plotdata/approx_exp.dat};
\end{axis}
\end{tikzpicture}
\end{codeexample}

	The option can be specified to the axis environment as well:
\begin{codeexample}[]
\begin{tikzpicture}
\begin{axis}[
  ymode=log,
  table/create col/linear regression/variance format=linear
]
	\addplot table[x=n,y=eapprox]
		{plotdata/approx_exp.dat};
	\addplot[no markers,red,thick] 
		table [x=n,y={create col/linear regression={
			y=eapprox,ymode=log}}]
		{plotdata/approx_exp.dat};
\end{axis}
\end{tikzpicture}
\end{codeexample}

    \end{key}
\end{stylekey}


\paragraph{Limitations:}

Currently, \PGFPlots{} supports only linear regression, and it only supports
regression together with |\addplot table|. Furthermore, long input tables might
need quite some time.

}
